Skip to content

Fix MQTT reconnection loop in SatelliteController after broker disconnect#180

Open
daGrumpf-bxp wants to merge 1 commit into
PiBrewing:developmentfrom
daGrumpf-bxp:fix/mqtt-reconnection-satellite-controller
Open

Fix MQTT reconnection loop in SatelliteController after broker disconnect#180
daGrumpf-bxp wants to merge 1 commit into
PiBrewing:developmentfrom
daGrumpf-bxp:fix/mqtt-reconnection-satellite-controller

Conversation

@daGrumpf-bxp
Copy link
Copy Markdown

Summary

Fixes the MQTT auto-reconnection logic in SatelliteController.listen(). Currently, when the MQTT broker (e.g. mosquitto) restarts or has a brief outage, cbpi never reconnects and stays offline until a manual systemctl restart craftbeerpi is performed. This is critical for users who control actors via MQTT (e.g. ESP-based relay boards for fermentation control) since cbpi stops sending commands entirely.

Root cause

In cbpi/controller/satellite_controller.py, the listen() method has two related bugs:

  1. Exception ordering inverted: except Exception is placed before except aiomqtt.MqttError. Since MqttError inherits from Exception, the MqttError handler is dead code — Python evaluates except clauses in order, and Exception catches everything first.

  2. break on connection errors: Both except Exception and (the unreachable) except aiomqtt.MqttError blocks contained break, which exits the while True reconnection loop. The very first disconnect kills the loop permanently.

Combined, this means:

  • When mosquitto drops the connection, aiomqtt raises an MqttError (subclass of Exception)
  • The except Exception block catches it, logs an error, and breaks out of the loop
  • The coroutine ends, no further reconnection attempts are made
  • All subsequent publish() calls fail silently with [code:4] The client is not currently connected.

Reproduction

Tested on Raspberry Pi OS Bookworm, cbpi4 4.7.5, aiomqtt 2.5.1, mosquitto broker on localhost.

# Terminal 1: monitor cbpi
sudo journalctl -u craftbeerpi -f

# Terminal 2: monitor MQTT connections
watch -n 1 "sudo ss -tnp | grep :1883"

# Terminal 3: simulate broker outage
sudo systemctl stop mosquitto
sleep 5
sudo systemctl start mosquitto

Before the fix: cbpi disappears from ss output and never reappears. Log shows a single MQTT General Exception: Disconnected during message iteration then silence, followed by floods of Failed to push data via mqtt: [code:4] The client is not currently connected. warnings.

After the fix: cbpi logs MQTT disconnected: ... Reconnecting in 5s every 5 seconds while the broker is down, then automatically reconnects when mosquitto is back up. The ESTAB connection reappears in ss within seconds of the broker recovery.

Changes

Single file modified: cbpi/controller/satellite_controller.py

     async def listen(self):
         while True:
             try:
                 async with self.client as client:
                     await client.subscribe("#")
                     async for message in client.messages:
                         for topic_filter in self.topic_filters:
                             topic = topic_filter[0]
                             method = topic_filter[1]
                             if message.topic.matches(topic):
                                 await method(message)
             except asyncio.CancelledError:
                 # Cancel
                 self.logger.warning("MQTT Listening Cancelled")
                 break
-            except Exception as e:
-                self.logger.error("MQTT General Exception: {}".format(e))
-                break
             except aiomqtt.MqttError as e:
-                self.logger.error("MQTT Exception: {}".format(e))
+                self.logger.warning("MQTT disconnected: {}. Reconnecting in 5s".format(e))
+            except Exception as e:
+                self.logger.error("MQTT General Exception: {}. Reconnecting in 5s".format(e))

             await asyncio.sleep(5)

Three changes:

  1. except aiomqtt.MqttError is moved BEFORE except Exception so it actually fires.
  2. break is removed from both connection-error handlers, allowing the while True loop to retry.
  3. Log levels and messages are adjusted: MqttError becomes a warning (it's an expected condition during broker outages, not an error), and both messages now clearly state that a reconnect is being attempted.

The existing await asyncio.sleep(5) provides a 5-second back-off between retries.

asyncio.CancelledError is left untouched — its break is correct since it's the intended shutdown signal.

Testing

Manually tested on the reproduction setup above with:

  • mosquitto stop/start cycles
  • Broker network unreachable (firewall block)
  • Broker config errors at restart (auth changes)

In all cases cbpi recovers automatically within keepalive + 5s after the broker is reachable again.

Backward compatibility

No API or behavior changes other than the fix itself. The log line numbers shift slightly which may affect anyone grepping for satellite_controller.py:XX patterns, but log messages remain searchable by content.

The listen() loop had two bugs preventing reconnection:
1. except Exception was placed before except aiomqtt.MqttError,
   making the MqttError handler dead code (Exception catches all)
2. Both except blocks had 'break', exiting the while True loop
   on first disconnect, so the client never reconnected

Fix: swap except order (MqttError before Exception), remove the breaks
on connection errors. The existing asyncio.sleep(5) handles backoff.
@daGrumpf-bxp daGrumpf-bxp force-pushed the fix/mqtt-reconnection-satellite-controller branch from 4db7d46 to f464f92 Compare May 19, 2026 16:11
@avollkopf
Copy link
Copy Markdown
Member

Thank you for bringing this up as it was also anoying myself for some time and I never managed to fix it. I will look into it, and merge, once I have time. Probably next weekend

@avollkopf avollkopf self-assigned this May 19, 2026
@avollkopf avollkopf added the bug Something isn't working label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants